3 research outputs found

    Decoding billions of integers per second through vectorization

    Get PDF
    In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.Comment: For software, see https://github.com/lemire/FastPFor, For data, see http://boytsov.info/datasets/clueweb09gap

    Fine-grained document clustering via ranking and its application to social media analytics

    No full text
    Extracting valuable insights from a large volume of unstructured data such as texts through clustering analysis is paramount to many big data applications. However, document clustering is challenged by the computational complexity of the underlying methods and the high dimensionality of data, especially when the number of required clusters is large. A fine-grained clustering solution is required to understand a data set that represents heterogeneous topics such as social media data. This paper presents the Fine-Grained document Clustering via Ranking (FGCR) approach which leverages the search engine capability of handling big data efficiently. Ranking scores from a search engine are used to calculate dynamic clusters’ representations called loci in an unsupervised learning setting. Clustering decisions are efficiently made based on an optimal selection from a small subset of loci instead of the entire cluster set as in the conventional centroid-based clustering. A comprehensive empirical study on several social media data sets shows that FGCR is able to produce insightful and accurate fine-grained solution. Moreover, it is magnitudes faster and requires less computational resources compared to other state-of-the-art document clustering approaches
    corecore